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Abstract. In this work we suggest a new model for generating random satisfiable fc-CNF formulas. 
To generate such formulas - randomly permute all 2'°(^) possible clauses over the variables xi, . . . ,x„, 
and starting from the empty formula, go over the clauses one by one, including each new clause as you 

r) . go along if after its addition the formula remains satisfiable. We study the evolution of this process, 

namely the distribution over formulas obtained after scanning through the first m clauses (in the 
random permutation's order). 

C^ ■ Random processes with conditioning on a certain property being respected are widely studied in the 

context of graph properties. This study was pioneered by Ruciiiski and Wormald in 1992 for graphs with 
a fixed degree sequence, and also by Erdos, Suen, and Winkler in 1995 for triangle-free and bipartite 
graphs. Since then many other graph properties were studied such as planarity and H-freeness. Thus 
our model is a natural extension of this approach to the satisfiability setting. 

Our main contribution is as follows. For m > en, c — c{k) a sufficiently large constant, we are able 
to characterize the structure of the solution space of a typical formula in this distribution. Specifi- 
cally, we show that typically all satisfying assignments are essentially clustered in one cluster, and all 

^—1- ■ but e~ (™/")72 of the variables take the same value in all satisfying assignments. We also describe a 

^ . 

o 

oo 

O 



Research supported in part by a USA-Israel BSF Grant, and by a grant from the Israel Science Foundation, and 

by Pazy Memorial Award. 

Research supported in part by NSF CAREER award DMS-0546523 and a USA-Israeli BSF grant. 



On the random satisfiable process 1 

1 Introduction 

Constraint satisfaction problems play an important role in many areas of computer science, e.g. 
computational complexity theory [lO], coding theory [16], and artificial intelligence }24] . to mention 
just a few. The main challenge is to devise efficient algorithms for finding satisfying assignments 
(when such exist), or conversely to provide a certificate of unsatisfiability. One of the best known 
examples of a constraint satisfaction problem is A;-SAT, which is the first to be proven as NP- 
complete. Although satisfactory approximation algorithms are known for several NP-hard problems, 
the problem of finding a satisfying assignment (if such exists) is not amongst them. In fact, Hastad 
|17j proved that it is NP-hard to approximate MAX-3SAT (the problem of finding an assignment 
that satisfies as many clauses as possible) within a ratio better than 7/8. 

In trying to understand the inherent hardness of the problem, many researchers analyzed struc- 
tural properties of formulas drawn from different distributions. One such distribution is the uni- 
form distribution where instances are generated by picking m clauses uniformly at random out 
of all 2'^(^) possible clauses. Although many problems still remain unsolved, in general this distri- 
bution seems to be quite well understood (at least for some values of m and k). This is also true 
for the planted A;-SAT model, where one first fixes some assignment ip to the variables and then 
picks m clauses uniformly at random out of all {2^ — 1)(^) clauses satisfied by ip. Comparatively, 
much less is known for variants of these distributions where extra conditions are imposed. These 
conditions distort the randomness in such a way that the "standard" methods and tools employed 
to analyze the original distributions are a-priori of little use in the new setting. Our work concerns 
the latter. 

1.1 Our Contribution 

In this work we suggest a new model for generating random satisfiable fc-CNF formulas. To generate 
such formulas - randomly permute all 2 (^) possible clauses over the variables xi, . . . ,Xn, and 
starting from the empty formula, go over the clauses one by one, including each new clause as you 
go along if after its addition to the formula, the formula remains satisfiable. We study the evolution 
of this process, namely the distribution over formulas obtained after scanning through the first m 
clauses (in the random permutation's order); we use V^^ to denote this distribution. Clearly, for 
every m, all formulas in V^^ are satisfiable (as every clause is included only if the so-far obtained 
formula remains satisfiable). 

Random processes with conditioning on a certain property being respected are widely studied in 
the context of graph properties. This study was pioneered by Rucihski and Wormald in 1992 ^25j for 
graphs with a fixed degree sequence, and also by Erdos, Suen, and Winkler in 1995 for triangle-free 
and bipartite graphs [TT]. Since then many other graph properties were studied such as planarity 
[21] . i?^-freeness |23] and also the property of being intersecting in the context of hypergraphs [6]. 
Thus our model is a natural extension of this approach to the satisfiability setting. The main 
difficulty when dealing with these restricted processes is that the edges of the random graph (and 
the clauses of the random fc-CNF formula) are no longer independent due to conditioning. Thus 
the rich methods that have been developed to understand the "classical" random graph models, 
Gn,p for example, do not carry over, at least not immediately, to the restricted setting. 

Quite frequently in restricted random processes, the typical size of a final graph or formula 
(after all edges/clauses have been scanned) is a fascinating subject of study. This is however not 
the case here, as it is quite easy to see that deterministically the final random formula will have 
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(2 — 1) (^) clauses and a unique satisfying assignment. Therefore, in the setting under consideration 
here the process itself (i.e. a typical development of a restricted random formula and of its set of 
satisfying assignments as the number of scanned clauses m grows) is much more interesting than 
the final result, and indeed in this paper we will study the development of a random satisfiable 
formula. 

As it turns out, if m is chosen so that almost all A;-CNF formulas with m clauses over n variables 
are satisfiable, then Vf^l^ is statistically close to the uniform distribution over such formulas since 
whp none of the m clauses will be rejected (writing whp we mean with probability tending to 1 
as n goes to infinity). Therefore if this is the case, then the clauses are practically independent 
of each other, and the "usual" techniques apply. Remarkable phenomena occurring in the uniform 
distribution are phase transitions. With respect to the property of being satisfiable, such a phase 
transition takes place too. More precisely, there exists a threshold d = d{n, k) such that almost 
all /c-CNF formulas over n variables with m clauses such that m/n > d are not satisfiable, and 
almost all formulas with m/n < d are ^15j. Thus, while V^^ is statistically close to the uniform 
distribution for m/n below the threshold, it is not clear how does a typical P^^^ instance look like 
when crossing this threshold (which is conjectured to be roughly 4.26 for 3SAT), and whether there 
exists a polynomial time algorithm for finding a satisfying assignment for such instances. 

In this work we analyze V^^^ when m/n is some sufficiently large constant above the satisfiability 
threshold. The first part of our result is characterizing the structure of the solution space of a typical 
formula in V^^. By the "solution space" of a formula we mean the set of all satisfying assignments 
(which is a subset of all 2" possible assignments) . Formally, 

Theorem 1. Let F be random k-CNF from V^^, m/n > c, c = c{k) a sufficiently large constant. 
Then whp F enjoys the following properties: 

1. All but e~^''^'^'n variables are frozen. 

2. The formula induced by the non-frozen variables decomposes into connected components of at 
most logarithmic size. 

3. Letting P{F) be the number of satisfying assignments of F, we have — log (3{F) = e~^("^/"-). 

By a frozen variable we mean a variable that takes the same value in all satisfying assignments. 
Notice that the third item in Theorem [T] follows directly from the first. One immediate corollary of 
this theorem is: 

Corollary 1. Let F be random k-CNF from V^^^, m/n > clogn, c = c{k) a sufficiently large 
constant. Then whp F has only one satisfying assignment. 

The corollary follows from the third item in Theorem [1] since e~^(™/") = o{n^^) for m/n > clogn, 
and therefore log/3(-F) = o(l), or in turn, P{F) = 1 + o(l). 

The characterization given by Theorem [1] is in sharp contrast with the structure of the solution 
space of V^^ formulas with m/n just below the threshold. Specifically, the conjectured picture, 
some supporting evidence of which was proved rigorously for k > 8 [2,22(1] , is that typically random 
fe-CNF formulas in the near-threshold regime have an exponential number of clusters of satisfying 
assignments. While any two assignments in distinct clusters disagree on at least en variables, any 
two assignments within one cluster coincide on (1 — e)n variables. Furthermore, each cluster has a 
linear number of frozen variables (frozen w.r.t. all satisfying assignments within that cluster). This 
structure seems to make life hard for most known SAT heuristics. One explantation seems to be 
that the algorithms do not "steer" into one cluster but rather try to find a "compromise" between 
the satisfying assignments in distinct clusters, which actually is impossible. 
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Complementing this picture rigorously, we show that a typical formula in V^l,^ (in the above- 
threshold regime) can be solved efficiently. Formally, 

Theorem 2. There exists a deterministic polynomial time algorithm that whp finds a satisfying 
assignment for k- CNF formulas from V^l^, m/n > c, c = c{k) a sufficiently large constant. 

Our proof of Theorem [2] is constructive in the sense that we explicitly describe the algorithm. 

Remark 1. Observe that in both theorems we have m/n > c{k), c some function of k. We assume 
that k is fixed, and therefore c{k) is some constant. The true dependency is given by c{k) = 
cq2'' where cq is some moderate universal constant, say 100. The exponential dependency on k is 
somewhat inevitable as the satisfiability threshold itself scales exponentially with k (asymptotically 
2^ In 2). In this work we do not go to such fine details as determining the constant cq, though this 
task is perhaps manageable. 

Remark 2. Another natural problem to study is A;-colorability. Similar to random /c-CNF formulas, 
the random graph Gn,p also goes through a phase transition w.r.t. the property of being fe-colorable, 
as up grows. Analogously to the random /c-CNF process that we defined, one can consider a re- 
stricted random graph process. Specifically, randomly order all (2) edges of the graph, go over them 
in that order and include each new edge as long as the resulting graph remains A;-colorable. Some of 
the results that we have for /c-SAT extend to the fc-colorability process. A more thorough discussion 
is given in Section [6l 

1.2 Related Work and Techniques 

Almost all polynomial-time heuristics suggested so far for random instances (either SAT or graph 
optimization problems) were analyzed when the input is sampled according to a planted-solution 
distribution, or various semi-random variants thereof. Alon and Kahale [3j suggest a polynomial 
time algorithm based on spectral techniques that whp properly A:-colors a random graph from the 
planted A;-coloring distribution (the distribution of graphs generated by partitioning the n vertices 
into k equally-sized color classes, and including every edge connecting two different color classes 
with probability p = p{n)), for graphs with average degree greater than some constant. In the SAT 
context, Flaxman's algorithm, drawing on ideas from |3], solves whp planted 3SAT instances where 
the clause- variable ratio is greater than some constant. Also jl3|12|18] address the planted 3SAT 
distribution. 

On the other hand, very little work was done on non-planted distributions, such as V^l^. In this 
context one can mention a work of Chen [7] who provides an exponential time algorithm for the 
uniform distribution over satisfiable /c-CNF formulas with exactly m clauses where m/n is greater 
than some constant. Ben-Sasson et al. [5] also study this distribution but with m/n = Qilogn), 
a regime where the uniform distribution and the planted distribution essentially coincide (since 
typically there is only one satisfying assignment), and leave as an open question whether one can 
characterize the regime m/n = o(logn). This question was resolved in ^ (and in ^ for the uniform 
distribution over /c-colorable graphs). 

While some of the ideas suggested in these works have proven to be instrumental for our setting, 
most of their analytical methods break when considering V^l^. In V^l^ not only do clauses depend 
on each other (unlike the planted distribution where clauses are chosen independently), but the order 
in which they are introduced also plays a role (which is not the case in the uniform distribution 
studied in [9], although the clauses are not chosen independently). Therefore we had to come up 
with new analytical tools that might be of interest in other settings as well. 
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1.3 Paper's Structure 

The rest of the paper is structured as follows. In Section [2] we discuss relevant structural properties 
that a typical formula in V^^^ possesses, the proofs of some properties are postponed to Sections H] 
and [5j One consequence of this discussion will be a proof of Theorem [TJ We then prove Theorem [2] 
in Section [3] by presenting an algorithm and showing that it meets the requirements of Theorem [2j 
In Section [H] we discuss the /c-colorability setting (mentioned in Remark [5]) more elaborately, and 
concluding remarks are given in Section [71 

To simplify the presentation we shall address, in what follows, only the case fc = 3. The case of 
general k easily follows from the same arguments (taking m/n > c{k), c{k) as mentioned in Remark 

2 Properties of a Random "P^l^ Instance 

This section contains the technical part of the paper. In it we analyze the structure of a typical 
formula in V^^. Here and throughout we think of m as cn^ c at least some sufficiently large 
constant. 

2.1 Preliminaries and Techniques 

When analyzing some structural properties of a random instance in V^l^ it will be more convenient 
to analyze the same property under a somewhat different distribution, and then to go back to V^l^ 
(maybe pay some factor in the estimate). 

The variation we consider is V^p and is defined as follows: permute at random all possible 
M = 8(3) clauses, go over the clauses in the permutation's order and include each clause with 
probability p = m/M if also its addition leaves the instance satisfiable. Let Vn,p be defined similarly, 
just without the conditioning (i.e., all clauses chosen at random are included in the formula, thus 
making it not necessarily satisfiable). 

Lemma 1. V^l^ = 'P^p\{ exactly m clauses were chosen}. 

Proof. To generate F in V^^yi one first picks a random permutation of the clauses and then scans 
one by one the first m clauses, skipping clauses whose addition will make the instance unsatisfiable. 
The key point is to notice that any ordered m-tuple of clauses is equally likely to be chosen as 
the first m clauses. This is exactly the case in V^l when conditioning on the fact that exactly m 
clauses were chosen - any set of m clauses is equally likely, and also any permutation of them. ■ 

Lemma 2. Set M = 8(3). For any property A, if p = m/M then Pr^n%[^] < 0{y/rri) ■ Pr'^^'p[A\. 

Proof. Let X be a random variable counting the number of clauses whose coin toss was successful. 
X is distributed Binom{S{T),p)., and therefore E\X\ = m. Standard calculations show that Pr[X = 
m] = fl{m~^-^). 

Pr^".'"[y4] = Pr^^^v[A\X = m] = j^^, ^ < 0{^/^) ■ Pr^-^'p[A]. 

Pr^n,,p ^x = m] 
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Remark 3. In the remainder of the section we analyze V^l instead of V^^. When we use the 
expression "with high probabihty" (abbreviated whp) we will always mean with probability 1 — 
o(m^^"). Lemma [2] will then imply that we can switch back to V^^ ^^d still the property holds 
with probability 1 — o(l). We will actually prove that all the properties hold with probability 
1 — o{n~^) which is always at least 1 — o{m~^''^) since m = 0{n^). 

2.2 The Discrepancy Property 

A well known result in the theory of random graphs is that a random graph whp will not contain a 
small yet unexpectedly dense subgraph. This is also the case for Vn,p (when considering the graph 
induced by the formula). In general, discrepancy properties play a fundamental role in the proof 
of many important structural properties such as expansion, the spectra of the adjacency matrix, 
etc., and indeed in our case the discrepancy property plays a major role both in the algorithmic 
perspective and in the analysis of the clustering phenomenon. The following discussion rigorously 
establishes the above stated fact. 

Definition 1. We say that a 3 CNF formula F on n variables is p-proportional if there exists no 
set U of variables such that: 

- \U\ < n/10^ 

— There are at least p ■ \U\ clauses in F each containing at least two variables from U. 

(We say that a clause C contains a variable x if x appears in C either as x or as x, in this context 
we do not differentiate between the two cases). 

Proposition 1. Let F be distributed according to Vn,p with n^p > d, d a sufficiently large constant, 
and set p = n?p/5500. Then whp F is p-proportional. 

Remark 4- To see how Proposition [1] corresponds to the random graph context, consider the graph 
induced by the formula F (the vertices are the variables, and two variables are connected by an 
edge if there exists some clause containing them both) and observe that every clause that contains 
at least two variables from U contributes an edge to the subgraph induced by U. Thus if we have 
many such clauses, this subgraph will be prohibitively dense. Since F is random so is its induced 
graph, and therefore the latter will typically not occur. 

Proof. The probability that a random formula F in 'Pn,p contains a set U of variables of size u 
that violates proportionality is at most (using the union bound): 



V- (A ( 8n(2) \ „„2p/55oo 

^ \uj \un'^p/5500j ^ 



o{n 



The first term accounts for the possible ways of choosing the variables of [/, the second is to choose 
the tin^p/5500 clauses that contain at least two variables from U (out of at most 871(2) possible 
ones), and the last term is just the probability of the chosen clauses to actually appear in F. To 
bound this sum we use the fact that u < n/10^, the fact that rpp can be arbitrarily large (constant), 
and the following standard estimate for the binomial coefficient: 

n\ /en\^ 



X \ X J 
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Corollary 2. Let F* be distributed according to V!^! with n^p > d, d a sufficiently large constant. 
Then whp F* is n'^p/ 5500 -proportional. 

The corollary follows easily by observing that the proportionality property is monotonically de- 
creasing. 

2.3 Crude Characterization of the Solution Space's Structure 

In this section we make the first step towards proving Theorem [1] (clustering). We give a rather 
crude characterization of the structure of the solution space of a typical instance in V^^L This 
characterization will be refined in the sequel. 

Definition 2. A 3CNF F is called r- concentrated if every two satisfying assignments ^i,'02 of 
F are at Hamming distance at most r from each other. 

Proposition 2. Let F* be distributed according to V^^ with n^p > d, d a sufficiently large constant, 
let p = 30/(n^p) then whp F* is pn- concentrated. 

An immediate corollary of this proposition is that typically all satisfying assignments of a V^l 
instance can be enclosed in a ball of radius 30/ (np) in {0, 1}". This gives a "first-order" character- 
ization of the structure of the solution space. 

Proof. Fix two assignments ip and "0 at distance an, and let us bound Pr[ip and "0 satisfy F*]. 
Assume w.l.o.g. that, say, (p is the all-TRUE assignment. We shall now upper bound the probability 
of a set of clauses in Vn,p that may result in an instance F* that is satisfied by both assignments. 
In particular a clause of the form Ci = {xV yV z), where a; is a variable on which p and ip disagree, 
and y, z are variables on which both agree, cannot be chosen to Vn,p- Let us call such a clause a type 
1 clause. If a type 1 clause appears is included, then either it is included in F*, and then '\\) cannot 
be a satisfying assignment, or it is rejected and then p is already at this point not a satisfying 
assignment. The same applies for clauses of the form C^ = {sV wV t), where on all three variables, 
s, w, t, both assignments disagree - call them type 2. It remains to upper bound the probability 
of a 'Pn,p instance that does not contain type 1 and type 2 clauses. There are an(^ ~2 ) type 1 
clauses and ("3") type 2 clauses. The probability of none being chosen is 

^C^T'") + (T) < exoi-„ . (anf^ " "^"^ + ^"" 

If 30/n'^p < a < 1/2 then 



(l-p)-l 2' J + UJ<exp{-p.(an(^ ^ ^ ] + { ^ ] ]}. (1) 



, (1 — a)n\ ., 

p ■ an ( ) > pan ■ n /8 > 3n. 



p ■ [ \ > n ■ n p/48 > 3n. 



If a > 1/2 then 



In the last inequality we use the fact that n'^p can be arbitrarily large (specifically, greater than 
144). In any case, the expression in ([T]) is at most 5"". Since we have no more than 4" ways of 
choosing the pair p, ^, we deduce using the union bound that whp no such "bad" pair exists. ■ 
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2.4 The Core Variables 

We describe a subset of the variables, referred to as the core variables, which plays a crucial role in 
the understanding of V^^. A variable is said to be frozen in F if in every satisfying assignment it 
takes the same value. The notion of a core captures this phenomenon. In addition, a core typically 
contains all but a small (though constant) fraction of the variables. This implies that a large fraction 
of the variables is frozen, a fact which must leave imprints on various structural properties of the 
formula. These imprints allow efficient heuristics to recover a satisfying assignment of the core. A 
second implication of this is an upper bound on the number of possible satisfying assignments, and 
on the distance between every such two. Thus the notion of a core plays a key role in obtaining a 
characterization of the cluster structure of the solution space. 

Let us now proceed with a rigorous definition of a core. Before doing so, we take a long detour 
on expanding sets. 

Definition 3. (support) Given a 3CNF formula F and some assignment ip to the variables, we 
say that a variable x supports a clause C (in which it appears) w.r.t. il^ if x is the only variable 
whose literal evaluates to true in C under ip. 

Definition 4. (expanding set) Given a 3GNF formula F and an assignment ip to the variables (not 
necessarily satisfying), a set of variables Z is called t-expanding in F w.r.t. ip if every variable 
X £ Z supports at least t clauses in F[Z] w.r.t. ip. 

F[Z] stands for the subformula of F containing the clauses where all three variables belong to Z. 
The following proposition illustrates the usefulness of Definition HI 

Proposition 3. Let F be a 3GNF formula on n variables and let Z be a t-expanding set w.r.t. 
some assignment ij). If in addition: 

— ip satisfies F, 

— F is n/ 10^ -concentrated (DefinitionWi), 

— F is t -proportional (Definition\^, 

then the variables in Z are frozen in F . 

Proof. By contradiction, let -0 be the satisfying assignment w.r.t. which Z is defined and let ip' 
be some satisfying assignment of F such that there exists a non-empty set [/ C Z of variables for 
which Vx € [/, '>p{x) ^ ip'{x) (if for every ip' it holds that [/ = then we are done). Take x G U and 
consider all the clauses that x supports w.r.t. ip in F[Z]. It must be that every such clause contains 
at least another variable y on which ip and ip' disagree (since every such clause is satisfied by ip' but 
the literal corresponding to x is false under ip'). Therefore y belongs to U by definition. We conclude 
that there exists a set U of variables and t • |C^| clauses each containing at least two variables from 
U (no clause was counted twice since the supporter of a clause is unique by definition). Further, we 
assumed that F is n/lO^-concentrated and therefore \U\ < n/10^. Combining the latter two facts 
we derive a contradiction to the ^-proportionality of F. ■ 

Proposition 4. Let F be distributed according to 'Pn,p with n^p > d, d a sufficiently large constant. 
Then whp there exists an integer t = t{n,p) > 0, a set Z of variables, and an assignment ip such 
that: 
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— Z is t-expanding w.r.t. ip, 

— \Z\ = (1 - e-^("'p))n. 

— F is t/10-proportional, 

— ip satisfies F* , 

— F* is n / IQ^ -concentrated, 

The complete proof of this proposition is deferred to Section [H 

Corollary 3. The set Z promised in Proposition [^ is frozen in F* , and furthermore Z is t- 
expanding w.r.t. every satisfying assignment of F* . 

Proof. To see why Z is frozen, let S be the set of clauses in F that are supported w.r.t. ip. First 
observe that F* is t-proportional as it is a subformula of -F (and F is i/10-proportional and therefore 
also t-proportional). Furthermore S is contained in F* . This is because V' is a satisfying assignment 
of F* throughout the entire generating process, thus every clause in S that arrives is not rejected. 
Therefore Z is also t-expanding in F* w.r.t. ip. Finally apply Proposition [3] to F*. The second part 
of the corollary is immediate from the fact that Z is frozen. ■ 

Definition 5. (self-contained sets) Given a 3CNF formula F we say that a set of variables Z is 
r- self- contained in F if every variable x G Z appears in at most r clauses in F\ F[Z]. 

Finally, we are ready to define a core. 

Definition 6. (core) A set of variables 7i is called a t-core of F w.r.t. an assignment ip if 7i is 
t-expanding in F w.r.t ip and also (t/ 3) -self- contained in F. 

The property of being self-contained is necessary for the algorithmic part (the proof of Theorem [21 
at least as our analysis proceeds). 

Proposition 5. Let F* be distributed according to V^p with n^p > d, d a sufficiently large constant. 
Then whp there exists an integer t = t{n,p) > 0, a satisfying assignment Tp of F* , and a t-core Ti 
w.r.t. tp such that: 

— \n\ = (1 - e-^("'p))n, 

— TC is frozen in F* , 

— F* is t/10-proportional. 

The proof of this proposition is best understood in the context of the proof of Proposition [H 
Therefore the proof appears in Section 14.11 

Remark 5. Observe that if there exist two t-cores TCi and TC2 that satisfy the conditions of Propo- 
sition [5l then also their union TCi U TC2 is a f-core (since the core variables are frozen). Therefore 
we may speak of a unique maximal t-core. Prom now on, when we refer to a t-core, we mean the 
maximal one. Note that this maximal core is also frozen by Proposition [3l Therefore it can serve as 
a t-core for any satisfying assignment of F and thus is effectively uniquely defined by the formula. 

2.5 Satellite Variables 

In this section we isolate another set of variables which we call satellite variables. As it turns out, 
to prove Theorems [1] and O it is enough to distinguish between core and satellite variables and all 
other variables in V. Let us start with a formal definition of a satellite variable. 
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Definition 7. Given a formula F with a core set Ti w.r.t. to an assignment ip, a variable x is 
called a 0-satellite with respect to TC ifx€Ti. A variable x is called an i-satellite if F contains a 
clause of the form {x V izi V ^22) or {x V iz^ V £z4,) where for every j = 1, 2, 3, 4, Zj is a b-satellite 
for b < i, and fiiz) = FALSE, moreover at least one of Zj is an [i — l)-satellite. We say that x 
is a satellite variable if it is b-satellite for some number b > 1. 

In this definition, £z stands for a literal corresponding to a variable z (i.e. I = z or I = z). Observe 
that ifTC is frozen in F then TCUS is frozen as well (this follows from a simple inductive argument). 

Before we formally state the property involving the satellite variables we introduce some addi- 
tional notation. The connected components of a formula F are the sub-formulas -F[Ci], . . . ,F[Ck], 
where Ci , C2 , . . . , C^ are the connected components in the graph Gp induced by F (the vertices 
of Gp are the variables, and two variables are connected by an edge if there exists some clause 
containing them both). Given a set of variables A and an assignment ip we denote by Fout{A, (p) the 
subformula of F which is the outcome of the following procedure: set the variables in A according 
to ip and simplify F (by simplify we mean remove every clause that contains a TRUE literal, and 
remove FALSE literals from the other clauses). 

Proposition 6. Let F* be distributed according to V^t with n^p > d, d a sufficiently large constant. 
There whp exists an integer t = t{n,p) > 0, a satisfying assignment -0 of F* , and a t-core TC w.r.t. 
ip such that: 

— \n\ > (l-e~^('^'p))n. 

— F* is t/10-proportional. 

— Let S be its satellite variables, TiU S are frozen in F* , 

— The largest connected component in F*^^{7i U S , ip) is of size at most logn. 

The new addition compared with Proposition [5] is the fact that we characterize the structure of 
the formula induced by the variables not in TCU S. 

Our proof strategy is the following. Expose the first part of the random formula F and consider 
a t-core TC promised whp by Proposition [5l We look at a "large" connected component outside the 
core (if none exists then we are done) and consider the following "shattering" procedure. Expose the 
second part of the random formula, and suppose for the time being that the core does not change 
(even if new clauses are included in F*). Let x be a non-core variable after the first part, which lies 
in a spanning tree of a large connected component. The key observation is that when resuming the 
random clause process, x becomes a satellite variable with high (constant) probability, in which 
case the spanning tree splits into parts. Since the tree is large, it contains many variables x, and 
therefore with very high probability at least one of them will become a satellite variable and shatter 
the tree. Finally, it remains to upper bound the number of possible large trees vs. the probability 
that such a tree does not survive. The complete proof is given is Section [5j 

One problem with the approach we just described is that we assumed that the core TC estab- 
lished after the first round does not change when resuming the random clause process. This is not 
necessarily the case as for example some core variables may violate the self-containment property 
and be removed, and this may cause a chain reaction of other variables leaving the core (maybe 
their support is too small, or they violate the self-containment requirement). However, whp all the 
variables the are removed from the core when resuming the random clause process remain satellite 
variables, and furthermore there are very few such variables. 

Remark 6. Li several papers which studied planted-solution distributions, for example [3|14j . a 
similar notion of a core appears (without the notion of satellite variables), and an analysis of the 
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structure of the instance (fc-colorable graph or fc-CNF formula) induced on the non-core variables 
is also given. The main difference from our setting is the fact that the planted distribution is 
a product space, and therefore it was possible to prove that the core variables are distributed 
similarly to a uniformly random set of variables. In our case establishing such a property is a more 
challenging task. As it turns out, the approach that we take - defining the satellite variables - 
simplifies considerably the proof of this property. 

2.6 The Majority Vote 

Given a 3CNF formula F and a variable x we let N~^[x) be the set of clauses in F in which x 
appears positively (namely, as the literal x), and N~{x) be the set of clauses in which x appears 
negatively (that is, as x). The Majority Vote assignment over F, which we denote by MAJ, assigns 
every x according to the sign of |A^^(x)| — |A^~(x)| (TRUE if the difference is positive and FALSE 
otherwise) . 

Proposition 7. Let F* be distributed according to V^l with v?p > d, d a sufficiently large constant. 
Then whp every satisfying assignments of F* differs from MAJ on at most e~^^^ ^'n variables. 

Proof. Consider the following two-step procedure to generate F: in the first step go over the 
M = 8 (g) clauses and toss a coin with success probability pi . We take the clauses that were chosen 
and put them first, ordered at random. Call Fi this first part (and respectively define F* in our 
standard way, i.e., by scanning sequentially the clauses of Fi and including those whose addition 
leaves the formula satisfiable.). Observe that Fi is distributed according to Vn,pi- Then in the 
second round, every clause that was not chosen in the first round is included with probability p2, 
and the chosen clauses are ordered at random and then concatenated after Fi. Call F2 this last 
part. At the end of this subsection we prove that F = F1UF2 is distributed according to Vn.p when 
p = Pi + (1 — Pi)p2- Therefore we may think of F as generated in two steps (with the suitable choice 
of Pi 5^2) • We will use this technique to prove several other properties as well. 

Let do be the constant promised in Proposition [5l and choose d > 200do. Set pi = p/200. By the 
choice of do and Proposition [5] w/ip all but e~^^" ^'n variables are frozen in F^, and w.l.o.g assume 
that they all take the value TRUE. Further observe that whp at this point all but e~ '"" ^'n variables 
appear in no more than say n'^p/30 clauses (in Fi distributed according to Vn,pi , and therefore also 
in F^). This is because every variable x is expected to appear in Fi in pi -8(2) < An'^pi = n^p/50 
clauses. These appearances are independent (binomially distributed), therefore one can apply the 
Chernoff bound for example to bound the probability that x appears in more than n^p/30 clauses, 
which will be e~^(" ^^ This in turn gives that the expected number of such variables is e~^'" ^'n. 
To obtain concentration around this value, consider an ordering on the M clauses and let Xi be an 
indicator random variable which is 1 iff clause i appeared in the first round. Let f(Xi,X2, . . . , Xm) 
be a function which counts the number of variables that appear in more than n^p/30 clauses in F. As 
claimed, E[f] = e~^'" ^•'n, and / satisfies the Lipschitz condition with difference 3: for every i and 
every two assignments a = (ai, . . . , ai_i, aj, Oj+i, . . . , um) and a' = (ai, . . . , ai_i, a^, Cj+i, . . . , qm) 
of values to Xi, . . . , Xm (that possibly differ on the «*'* coordinate), it holds that |/(a) — f{a')\ < 3 
(every clause contains three variables). Using the method of bounded differences (e.g.. Theorem 
7.4.3 of [1]) it follows that / is concentrated around its expected value. 

Let Z be then the set of frozen variables that appear in at most n^p/30 clauses of -Fi . Recall 
that we have assumed w.l.o.g. that they all froze to TRUE. By the above discussion together with 
Proposition [5] whp 

\Z\ > (1 - e-^^'^'PV - e-^("'P)n > 0.999n. 
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Now let us consider the second iteration of coin flips. Fix a; G Z, observe that every clause containing 
X positively, if chosen in the second round will be included in F*. There are at least 4(' ^~ ) — n^p/30 
such clauses with the other two variables from Z - call them "good" clauses. As for clauses where 
X appears negatively, and the other two variables are in Z, there are only at most 3(' ^ ) clauses 
such that if chosen will be included (since one way of negating the variables in Z results in a FALSE 
clause on frozen variables) - call them "bad" clauses. In addition there are at most 8(n — \Z\)n 
clauses, containing x and at least one variable outside Z, that we don't say anything about, but let 
us adversarially assume that x appears in all of them negatively, and if chosen are included in F* 
(they are also part of the bad clauses) . 

In expectation, p2 • (4(' 2 ) ~ JT-^p/SO) > l.Sn^p good clauses containing x will be chosen in 

the second round, and p2 • (3(' 2 ) + ^("' ~ l-^l)^) — l-6n^p bad clauses. (Recall that 199p/200 < 

P2 < P-) 

Suppose that in the n^p/30 clauses from the first round also x appears negatively. To conclude, 
for the majority vote of x to be wrong it must have been the case that the number of good clauses 
containing x or the number of bad clauses containing x deviates by at least (1.8 — 1.6 — l/30)n^p/2 
from its expectation. But since both are binomially distributed with expectation 0{n'^p), this 
happens with probability e^^'" ^K Using the linearity of expectation all but e~^" ^'n of the 
variables in Z are expected to have a "proper" gap. To obtain concentration around this value we 
use again the method of bounded differences, similarly to hat has been used earlier in the proof. 
Finally observe that \Z\ > (1 - e-^^^'^V, and therefore \Z\ - e'^^'^^P^n = (1 - e-^("'p))n as 
required. ■ 



Justifying the two-step distribution. Let 'Pn,pi,p2 be the distribution of the two-step process. 
For brevity, set Vi = Vn,p, V2 = 'Pn,pi,p2- Let us now prove that Vi and V2 are identical for 
p = pi + (1 — pi)p2- Let a be an ordered list of |cj| = r clauses. Then 

Pr-p, [ get list a] = - ^ ^^ 



On the other hand. 



r I 



D r + 1- J- 1 s^PiK'-- Pi) P2 \'--P2 
Prv2 get hst (t\ = 2_^ 7. • 7 ^ 



1=0 






(i-p,)^piii-P2r-^Yl 



1=0 



r — i]\ 



(i-Pi)"p^(i-P.)"-^E0(^ 

{i-Pirf2{i-P2r-'Mi+ "' 



pi 



Pi)P: 



r! V (1 -pi)p2 
{{l-pi){l-p2))^-'-{pi +P2-P1P2Y 



r\ 



Choosing pi,p2 to satisfy pi +p2 — P1P2 = P, we conclude that the distributions Vi and V2 are 
indeed identical. 
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2.7 Proof of Theorem H 

Theorem [1] follows from Proposition [6] which implies that all but e~^" ^'n of the variables are 
frozen. Therefore, there are at most exp{e~^" ^'n} possible ways to set the assignment of the 
remaining variables. Furthermore, Proposition [6] describes the formula induced by the non-frozen 
variables. 

3 Proof of Theorem [1 



SAT(F, t) 

Step 1: Majority Vote 

1. TTi ^- Majority Vote over F. 
Step 2: Reassignment 

2. for i = 1 to log n 

3. for allxeV 

4. if X supports less than 2f/3 clauses w.r.t. tt^ then vri+i <— ni with x flipped. 

5. end for. 

6. end for. 

Step 3: Unassignment 

7. set Ipl — TTlogn, i = 1. 

8. while 3x s.t. x supports less than t clauses w.r.t. xpi 

9. set ipi+i <— ipi with x unassigned. 

10. i^i + 1. 

11. end w^hile. 

Step 4. Unit Clause Propagation 

12. Let ^ be the final partial assignment obtained at Step 3. 

13. Remove all clauses which are satisfied by ^, and all FALSE-literals from the remaining clauses. 

14. Run the unit-clause-propagation algorithm on the resulting instance. 
Step 5: Exhaustive Search 

15. Let F' be the formula remaining after the unit-clause-propagation of Step 4 terminates. 

16. Exhaustively search and satisfy F'out{A,^), component by component. 



Fig. 1. The algorithm SAT 



In this section we prove that the algorithm SAT, which is described in Figure [H meets the 
requirements of Theorem [2l The main principles underlying SAT were designed with the planted 
distribution in mind (see [14J for example). An additional ingredient that we add is a unit-clause- 
propagation step. Given a 1-2-3-CNF formula (namely a formula which contains clauses of size 1,2 
and 3), the unit-clause-propagation is the following simple heuristic: 

while there exists a clause of size 1, set the variable appearing in this clause in a satisfying 
manner, remove this clause and all other clauses satisfied by this assignment, and remove 
the FALSE literals of the variable from other clauses. 

We say that F* is typical in Vn% i^ Propositions [6] and [7] hold. The discussion in Section [2] 
guarantees that indeed whp F* is typical. Therefore, to prove Theorem [2] it suffices to consider a 
typical F* and prove that SAT (always) finds a satisfying assignment for F* . As the parameter t 
for SAT we use the t promised in Proposition [6l 
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We let TC be the t-core promised in Proposition^ S its satellite variables, and if be the satisfying 
assignment w.r.t. which TC is defined. In all the following propositions we assume F* is typical (we 
don't explicitly state it every time for the sake of brevity). 

Proposition 8. Let tpi he the assignment defined in line 1 of SAT. Then ipi agrees with ip on the 
assignment of all variables in TC. 

Proof. Let Bi be the set of core variables whose assignment in ttj disagrees with ip at the beginning 
of the i iteration of the main for-loop - line 2 in SAT. It suffices to prove that |-Bi+i| < \Bi\/2 
(if this is true, then after logn iterations -Biogn = O)- Observe that by Proposition [71 |-Bo| < n/10'^ 
(as the Majority Vote error-rate e"^*^" ^^ can be made arbitrarily small). By contradiction, assume 
that not in every iteration |-Bi+i| < |i?j|/2, and let j be the first iteration violating this inequality. 
Consider a variable x S Bj^i. If also x € Bj, this means that x's assignment was not flipped in the 
jth iteration, and therefore, x supports at least 2t/3 clauses w.r.t. vtj. Since TC is t/3-self-contained, 
at least 2t/3— 1/3 = t/3 of these clauses contain only core variables. Since the literal of x is true in all 
these clauses, but in fact should be false under if, each such clause must contain another variable on 
which ip and ttj disagree, that is another variable from Bj. If x ^ Bj, this means that x's assignment 
was flipped in the j^^ iteration. This is because x supports less than 2t/3 clauses w.r.t. ttj. Since x 
supports at least t clauses w.r.t. ip (t-expanding property of the core), it must be that in at least 
t — 2t/3 = t/3 of them, the literal of some other core variable evaluates to TRUE (not FALSE as 
it should be in ip). Letting U = BjU Bj^i, there are at least t/3 • |-Bj+i| clauses containing at least 
two variables from U (every clause is counted exactly once as the supporter of a clause is unique). 
Using our assumption, |i?j+i| > \Bj\/2, we obtain \U\ = \Bj U -Bj+i| < \Bj\ + |i?j+i| < 3|-Bj+i|, 
therefore t/3 • \Bj+i\ > {\U\/3) ■ t/3 = {t/9)\U\. Finally, 

— \Bj\ < n/10^ (because Bq is already small enough, and by our assumption the sets Bi, B2, ■ ■ ■ Bj 
only decrease in size), 

— |^j+i| may exceed n/lO'^, in which case we consider w.l.o.g. only the first n/10^ variables (this 
is in line with our assumption |i?j+i| > \Bj\/2), 

— \U\< 3\Bj+i\ < 3n/10^ < n/10^ 

— there are t|f7|/9 clauses containing two variables from U. 

The last two items contradict the t/10-proportionality of F* . 



Proposition 9. Let ^ be the partial assignment defined in line 12 of SAT. Then all assigned vari- 
ables in ^ are assigned according to ip, and all the variables in TC are assigned. 

Proof. By Proposition [HI ipi coincides with ip (the satisfying assignment w.r.t. which TC is defined) 
on TC. Furthermore, by the definition of t-core, every core variable supports at least t clauses 
w.r.t. 93, and also w.r.t. ij^i (the assignment at hand before the unassignment step begins). Hence 
all core variables survive the first round of unassignment. By induction it follows that the core 
variables survive all rounds. Now suppose by contradiction that not all assigned variables are 
assigned according to 99 when the unassignment step ends. Let U be the set of variables that 
remain assigned when the unassignment step ends, and whose assignment disagrees with (/?. Every 
X £ U supports at least t clauses w.r.t. to ^ (the partial assignment defined in line 12 of SAT), but 
each such clause must contain another variable on which ^ and ip disagree (since ip satisfies this 
clause). Thus, we have t-\U\ clauses each containing at least two variables from U (again no clause 
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is counted twice as the support of a clause is unique). Since U CiTC = 9 (by the first part of this 
argument) and \H\ > (1 — e~^*^" P))n it follows that \U\ < e"^^"' ^^n < n/10^, contradicting the 
t/10-proportionality of F*. ■ 

Proposition 10. By the end of the unit-clause propagation step all the variables which get assigned 
are assigned according to ip, furthermore the set of satellite variables S is assigned. 

Proof. The proof is by induction on the iterations of the unit clause propagation. The base case 
are clauses of the form (x V ^^ V iy) where iz,iy are FALSE literals under S, and x is unassigned. 
By the previous proposition, ^ can be extended to a satisfying assignment of F, but every such 
extension must set x = TRUE. This is exactly what the unit clause propagation does. The step of 
the induction is proven similarly to the base case. 

Now to the satellite variables. The previous proposition gives that 7i remains assigned according 
to (f. By the definition of satellite variables, S will be set in the unit clause propagation (the i- 
satellite variables will be set in iteration i of the unit-clause propagation) . ■ 

Proposition 11. The exhaustive search, Step 5 of SAT, completes in polynomial time with a sat- 
isfying assignment of F* . 

Proof. By Proposition \T0\ the partial assignment at the beginning of the exhaustive search step 
is partial to the satisfying assignment ip of the entire formula. Therefore the exhaustive search will 
succeed. Further observe that the unassigned variables are outside of TC U S. Proposition [6] then 
guarantees that the running time of the exhaustive search will be at most polynomial. I 

Theorem [2] follows. 

4 Proof of Proposition [4] 

Let F be the random Vn,p instance, F* be its satisfiable part. We divide the process of generating 
F into two steps like in the proof of Proposition [71 in the first round go over the M = 8 (g) clauses 
and toss a coin with success probability pi = p/2. Take the clauses that were chosen and put them 
first ordered at random. In the second round, every clause that was not chosen, is included with 
probability p2, P2 satisfies pi + (1 —pi)p2 = P', then the included clauses are ordered at random and 
concatenated after the first part. Observe that this distribution is identical to Vfl^p as explained 
before. 

Let t be such that F (and hence also Fi) is whp t-proportional (we can choose t = n^p/5500 
as asserted in Proposition [T]). Also take ri^p sufficiently large so that F* is whp n/lO^-concentrated 
(as required by Proposition U and as promised to be the case whp by Proposition [2]). 

Fix Ip to be some assignment (not necessarily a satisfying assignment of F*), and let B^ be 
a random variable counting the number of variables whose support in Fi w.r.t. ip is smaller than 
502t. A bound of the sort Pr[B^ > n/lO"^] = 0(2"") would be very useful as we can then take 
the union bound over all possible assignments tjj. Fix some variable x, and w.l.o.g. assume x is 
TRUE in ip. There are ("2 ) clauses that x supports w.r.t. ip, each included w.p. pi. Therefore 
in expectation x supports at least n^pi/3 = n?p/Q clauses. Since the support of x is distributed 
binomially, the probability that x supports less than t clauses in Fi w.r.t. ip is at most e~" ^'^^ 
(say, use the Chernoff bound). Finally observe that the set of clauses that x supports is disjoint 
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from the set of clauses that y ^ x supports. Therefore, the probabihty that there are at least n/10'^ 
such variables is at most (^/iqt)^" p/50)-(n/io ) ^ g-n ^^^ sufficiently large v?p. 

In particular, whp every ip that satisfies F^ has the desired property. Let now ^ be a satisfying 
assignment of F^ such that B^ < n/lO"^, and consider the following procedure which, as we shall 
prove, produces a large 500t-expanding set Z in F^ (and therefore also in F* which contains F^). 
When using the notation F[A] for a formula F and a set of variables A we mean all clauses in F 
in which all three variables belong to A. 



1. set Zo = V \ {x £ V : X supports less than 502t clauses in Fi w.r.t. ijj}; i = 0. 

2. while there exists a variable a-i G Zi that supports less than 500f clauses in Fi[Zi] do Zi+i = Zi\ {ai}; i ^- i + 1. 

3. let ar be the last variable removed in step 2. Define Z = Zr+i- 



Fig. 2. Building a t-expanding set 



Clearly, Z is 500t-expanding in Fi (by the construction). It remains to prove that Z is large. By 
our assumption on B^ step 1 removes at most n/10^ variables, let A be those variables. It remains 
to prove that in the iterative step not too many variables were removed. Suppose by contradiction 
that in the iterative step more than n/10^ variables were removed, and consider iteration j = n/Kf 
and the set W = {oi, . . . ,aj} (ai G W is defined in line 2 of Figure [2]). Every ai G W appears 
in more than 502t — 500t = 2t clauses in which at least another variable belongs to U = W L) A 
(by the choice of Zq and the condition in line 2 that caused ai to be removed). Therefore, by 
iteration n/10^, the set U contains at most n/lO'^ + n/lO'^ < n/10^ variables, and there are more 
than 2t ■ \W\ > 2t ■ \U\/2 = t\U\ clauses containing at least two variables from U (no clause is 
counted twice as the support of a clause is unique). This contradicts the i-proportionality of Fi. 
To conclude, \Z\ > (l — 10~^) n > 0.99n as required. Observe that \W\ > \U\/2 by our assumption 
on the size of A and by the choice of j = n/10^. 

It follows that whp for every satisfying assignment ■0 of F^ there exists a 500t-expanding set Z 
of variables of cardinality \Z\ > 0.99n. W.l.o.g. we can take Z to be maximal such set. 

Observe that Z and F* satisfy the conditions of Proposition [3] (that is, V' is a satisfying assign- 
ment, F^ is t-proportional and n/lO^-concentrated) and therefore Z is frozen in F^; w.l.o.g. assume 
that all variables in Z froze to TRUE. Since all variables of Z are frozen in F^ , we can take the 
same Z for every satisfying assignment ip of F^. 

So let Z be as above, \Z\ > 0.99n. Now we consider the second round of coin tosses, call the 
chosen clauses F2. We prove that after adding them, with probability 1 — o(2~") Z extends to a t- 
expanding set Z', Z C Z', of the required size {\Z'\ > (1 — e~^(" P>)n). Fix some variable x ^ Z and 
observe that x supports ( 2 ) clauses, where x appears without negation and the other two variables 
are in Z and appear as negated. Since x ^ Z, we know that in the first iteration at most 500t such 
clauses were included. In expectation, F2 contains at least P2 [{ 2) ~ 500f j > n'^p/5 > lOOOt such 
clauses (this is due to p2 > p/2). If indeed at least 500i clauses are included then Z U {x} is a 
500t-expanding set. The probability that less than 500t of them were included is e"^" ^' (again, 
Chernoff bound). We can argue similarly about the number of clauses in F2, containing x and two 
variables from Z, where all three variables appear as negated. 

Call a variable x good if it participates it at least 500i clauses in F2 where the other two variables 
are from Z and are negated and x is not negated, and also in at least 500t clauses in F2 where the 
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other two variables from Z and all three variables are negated; otherwise x is called had. Observe 
that for every good x, for every satisfying assignment ■0 of Ff U -F2 ; "^^ can add x to Z, regardless 
of whether -(/' sets x to TRUE or FALSE. The above argument shows that the expected number 
of bad variables is e~^\^ P'n. Applying standard concentration techniques, we can derive that whp 
the number of bad variables is whp e~ '" ^'n as well. 

And so we have proven that whp there exists a 500t-extending set Z' (which contains Z) and 
\Z'\ = \Z\ + (1 - e-^("'p))|y \ Z| > (1 - e-^("'p))|y|. 

For conclusion, we have shown that there exists a 500t-expanding set Z' in F* of cardinality 
\Z'\ = (1 — e~^" P^)n w.r.t. tj^, where t/^ is some satisfying assignment of F* (in fact this is true 
w.r.t. all satisfying assignments of F* by the frozenness property). Scaling everything down (setting 
t' = 500t), Z' is t'-expanding and (at least) t'/500-proportional. This completes the proof of the 
proposition. 

Remark 7. Note that here we proved i'/500-proportionality, which is stronger than what we are 
required to prove (f'/lO-proportionality). In general, we prefer clear and shorter presentation over 
optimizing the constants in the proofs. Later we will use this slackness in other proofs that rely on 
this one. 

4.1 Proof of Proposition \E\ 

Let Z be the t-expanding set promised by Proposition [H Consider the procedure in Figure [3l which 
shall produce a t'-core (for t' = lOt/11). Recall that using the notation F[A\ for formula F and set 
of variables A we mean all clauses in F in which all three variables belong to A. 



1. 


set Ho = Z and i — 0. 


2. 


while there exists a variable at G Hi that: 




— at appears in more than t/11 clauses in F\ F[Hi], or, 




— ai supports less than lOt/11 clauses in F[Hi], 




do H,+i = H,\{a,}. 


3. 


let Or be the last variable removed in step 2. Define H = -ffr+i- 



Fig. 3. Building a t-core 



First let us explain why indeed "H is a f'-core. By its construction Tl is lOt/11-expanding (or 
t'-expanding) . Further, 7i is t/11-self-contained, or t'/lO-self-contained (which also implies t'/3-self- 
contained as 1/3 > 1/10). 

Remark 8. By the definition of a core we are required to prove only t'/3-self-contained, but we shall 
need this slackness in the proof of Proposition [6l 

It remains prove that \7{\ > (1 — e~ '" P')n. By Proposition HI \Ho\ > (1 — e~" P'^'^)n for some 
constant ci > independent of n,p. Let A = V \ Hq, and note that 1^41 < e~" f'^'^n. Suppose 
that the iterative procedure (line 2) removed more than e~" P'^^^n variables. Consider iteration 
2 = e"*^ vjc^nfi and the set W = {ai, . . . ,aj} {ai (^ W is defined in line 2 of Figure [3]). Define 
U = WU A. One possibility for the removal of Oj is that it appears in at least t/11 clauses in which 
at least another variable belongs to U. Another is that Oj supports less than lOt/11 clauses w.r.t. 
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F[Hi]. In the latter case Oj must support at least t — lOt/11 = t/11 clauses in F \ F[Hi] (by the 
choice of Oj G Z). In any case aj appears in at least t/11 clauses with at least another variable 
from U. Therefore, by iteration j, there exists a set U containing at most 2e~" ^''^^n << n/10^ 
variables, and there are at least (t/33) • \W\ > (t/33) • (|?7|/2) = t|C/|/66 clauses containing at least 
two variables from it (we divide t/11 by 3 as a clause could have been counted three times). This 
however contradicts the t/500-proportionality of F (recall that in Proposition [H when proving the 
existence of a t-expanding set Z, we actually proved that F is i/500-proportional - Remark [7]). 

Finally, the core variables are frozen as they are a subset of Z, and Z - the i-expanding set - 
is frozen. 

5 Proof of Proposition [6] 

Let F be the random 'Pn,p instance, F* its satisfiable part. We divide the process of generating F 
into two steps like in the proof of Proposition [71 in the first round go over the M = 8 (o) clauses 
and toss a coin with success probability pi = p/2. Take the clauses that were chosen and put them 
first. In the second round, every clause that was not chosen is included with probability p2, where 
P2 satisfies pi + (1 —pi)p2 = P- Let F* be the part of F* that corresponds to the first iteration {F^ 
is distributed according to Vn%i)- Let F2 be the clauses that were chosen in the second round. 

By Proposition (Sj if we take n'^p to be sufficiently large, then F^ has whp a i-core Ti w.r.t. 
to a satisfying assignment ^jJ with the following properties (the last property did not appear in 
Proposition m we define and justify it immediately after): 

— -Fi is t/500-proportional (Remark [7]). 

— TC is f/10-self-contained and not only t/3-self-contained (Remark [8]). 

— 1^1 > (l-e-^('''p))n. 

— TC is frozen. 

— F^ is bounded. 

We say that a formula F is bounded if no variable appears in more than n clauses. In Fi every 
variable is expected to appear in 0{n?p) clauses, and we may assume that vP'p = 0{v}''^) (if not, 
then in particular whp 7i = V and the entire discussion in this section is unnecessary). Standard 
calculations then show that whp no variable appears in more than n clauses of Fi . 

We now discuss what happens to Ti. in the second round, that is when adding F2. We will be 
interested in large connected components of Fi whose vertices are not in 7i (Proposition I14p . and 
also in vertices that may leave TC due to F2 (Propositions [12] and [T3]) . The key to understanding 
the transformation that TC and the connected components undergo lies in the notion of satellite 
variables. 

First observe that Fi is whp t/500-proportional, and therefore also is F2 (as they are almost 
identically distributed, and there is enough slackness in the choice of constants to accommodate 
this difference). Hence whp F = FiU F2 is i/250-proportional (and so is F*). Assume that this is 
the case. 

Proposition 12. // after the second round F* remains t/250-proportional, then there exists a 
satisfying assignment ip of F* and a set Ti' '^ 7i of variables which is a t/2-core of F* w.r.t. tp. 
FuHhermore, \n'\ > (1 - e-^^^^'^V- 

Proof. We call a variable x £ TC dirty if in F2 there exists a clause C containing x and some 
variable not in TC. Let D be the set of dirty variables. For a specific x, there are e~^(" ^'n? clauses 
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such that if chosen to F2 wih make x dirty. The probabiUty that any of them appears is at most 
P2-e~^^"' ^'11? = e~^(" P-* (since e~^(" '^' is much smaller than 'n?p2 for sufficiently large p). Linearity 
of expectation gives -E[|-D|] = e~^" ^'n. Also observe that D satisfies the Lipschitz condition 
with difference 3 (as every new clause can effect 3 new variables). Therefore also concentration is 
obtained. Let us assume from now on that indeed \D\ = e~ '" ^'n. 

Consider Ti after scanning F2 (to complete F*) and set TLq = TC\ D,i = 0. Very similarly to 
the procedure in Figure [3l consider the following iterative procedure: 

while there exists x £ TCi s.t. x supports less than t/2 clauses in F\Hi\ w.r.t. ip, or appears in 
more than t/6 clauses where some variable belongs to V\7ii, define TCi+i = T-Ci\{x},i = i+1. 

2 1 1 2 

Set i = e"^" ^n, where c is some constant satisfying \D\ < e~^^ ^n. Suppose that the iterative 
process reached iteration i, and let Wi be the set of variables that were removed in iterations 
1 . . . i, let U = Wi U D, and observe that \Wi\ > \U\/2 by our choice of c. Take x £ Wi, if x was 
removed in iteration i because it appeared in more than t/6 clauses where some variable belongs 
to 1^ \ TCi, then since x was part of TC to begin with, and TC was i/10-self-contained, then x must 
appear in at least t/6 — t/10 = t/15 clauses in which at least another variable belongs to U. If 
X was removed because it supports less than t/2 clauses in F[TCi], then again, x was part of TC, 
and therefore it supports at least t clauses in F[TC], and hence it must support (and, in particular, 
appear in) at least t — t/2 = t/2 clauses in which some variable belongs to U. At any rate, every 
X G Wi appears in at least t/15 clauses in which at least another variable belongs to U. Finally, 

— there are tlVF^I/lS • 1/3 > t|[/|/90 clauses containing at least two variables from U (we divide 
by 3 as every clause might have been over-counted up to 3 times, and we use the fact that 

m\ > \u\/2), _ ^ 

— \U\ = \D\ + {Wil = e ^^ ^'n < n/10^ (we used our estimate on \D\, and the fact that we look 
at the iterative process until iteration i, therefore \W\ < £ = e"^*-"" P'n). 

Combining these two facts contradicts the i/250-proportionality of F* . Therefore if we let W denote 
the set of variables that were removed in the iterative step, in all iterations, then whp \W\ < L Now 
set t' = t/2, and let TC' = TC\ {D U W}. We have shown that the set TC' is a t'-core of the required 
size. Further, F* is (at least) t'/lO-proportional as required by Proposition [6l 

Finally observe that TC is frozen and hence TC' '^ TC is frozen too. Therefore although TC' is 
defined w.r.t. ip, it will be a core of F* regardless of which satisfying assignments survive at the 
end (as it will be a core w.r.t. all F*'s satisfying assignments, and at least one is guaranteed to 
survive) . ■ 

Proposition 13. Let S' be the satellite variables ofTC'. If F* is t/W-proportional then TC\TC' C S' . 

Proof. Let A = TC\TC'. Let S' be all the satellite variables of TC', and by contradiction assume 
that the set B = A\S' is non-empty. Every x in B belongs to TC and therefore supports at least 
t clauses where the other two variables appear in TC. Observe that in none of these t clauses the 
other two variables are in TC' U S' (as otherwise x is in S'). Therefore we have found a set B, 
\B\ = e~ ('^ P'n < n/10^, for which there are at least t\B\ clauses containing two variables from B. 
This contradicts the t/10-proportionality of F*. ■ 

In the proof of Proposition [6] we consider two "types" of satellite variables. The first type, which 
we just met, are the variables in TC\ TC'. The second type, which we will make use of in the proof 
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of Proposition 1141 ahead, are satellite variables of TC whose "job" is to shatter the large connected 
components in the formula induced by variables not in TC (when exposing the second part of F). 
In some sense these two types represent competing processes. The one is variables leaving TC, but 
still remaining satellite variables, the other is new variables attaching to TC as satellite variables. 

Recall our notation FoutiA,(p) {A a set of variables, if an assignment) which stands for the 
subformula of F which is the outcome of the following procedure: set the variables in A according 
to ip and simplify F (by simplify we mean remove every clause that contains a TRUE literal, and 
remove FALSE literals from the other clauses). The connected components of a formula F are the 
sub-formulas -F[Ci], . . . , F[Ck], where Ci, C2, . . . , Cfc are the connected components in the graph 
Gp induced by F (the vertices of Gp are the variables, and two variables are connected by an edge 
if there exists some clause containing them both). 

Proposition 14. Let TC be a t-core of F^ , let S be the set of all satellite variables ofTC in F* , and 
let Ip be a satisfying assignment of F* . Then the largest connected component in F*^f[TC U S , ip] is 
whp of size at most logn. 

First let us show why Proportion [T3] completes the proof of Proposition [6l Since the proposition 
is true for F it is true, by monotonicity, for F*. We take TC' for the core to be given by Proposition 
[6l and denote by S' its satellite variables. Observe that (under the assumption of proportionality) 
TC C TC'US', and hence by the definition of satellite variables 5 C 5'. In particular TCUS C TC'US'. 

5.1 Proof of Proposition 1141 

Let us refine the process of generating F: first we generate Fi (and F^), and fix TC according to 
F*. Then in the second round {F2) first toss the coins of clauses C s.t. at most one literal in C 
belongs to TC, call J C F2 the set of clauses that were chosen. Finally toss the coins of the other 
clauses (the ones that were not picked in the first step and contain at least two variables from TC), 
call K <^ F2 the set of clauses that were chosen. In this new terminology F = FiL) J U K, and set 
F' = FiU J. To prove Proposition [T3] it suffices to consider only trees of size logn in F' . This is 
because (a) every connected component of size at least log n contains a tree of size log n, and (6) 
only the clauses of F' may contribute edges to the connected components of F*^f[TC U 5, ■0]- 

We will prove Proposition [T3] as follows: fix an arbitrary tree T on r vertices, and let V{T) denote 
its set of vertices. The following two conditions are necessary for T to belong to F*^^[TC U S,ip]: 

— A = {there exists a subformula of F' that induces T} , 

— B = {the clauses in K do not prevent the following from holding: V{T) n 5 = 0}. 

The probability that F*^i\TC yj S,ip\ contains a tree of size at least r is at most 

^ Pr[A^B]= Y^ Pr[A\-Pr[B\A]<( max Pr[5|yl]y | ^ Pr[A]\ = q-h. 

T:\V{T)\=r T:\V{T)\=r ^ '' ^ '^'"'^ ^ \T:\V{T)\=r ) 



Our next goal is to bound q and h, and then to show that q ■ h = o{n~'^) for r = logn. In fact 
we shall prove that q ■ h = o{n~^"' ^') for r = logn. The next two lemmas establish the desired 
bounds (we use d = n^p). 

Lemma 3. /i = T.T:\v(T)\=rP'^[^\ ^ n{imdY . 
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Lemma 4. g = max7^.|y(7-)[=r Pr[i?|A] < e^*'^. 
To conclude, for r = logn, 

The last equality is true since d/8 >> 1 + log(lOOd) for sufficiently large d = n^p. We shall now 
prove the two lemmas. 

Proof of Lemma [3l The quantity h to be estimated is obviously the expected number of trees of 
size r = log n induced by a formula F' ^Z F and is therefore at most the expected number of such 
trees induced by F itself. We thus estimate from above the latter quantity. 

Let r be a fixed tree on r variables (a tree in the regular graph sense), and let Ft be a fixed 
collection of clauses such that each edge of T is induced by some clause of Ft - we call such 
Ft an inducing set of clauses. We say that a clause set Ft is minimal w.r.t. T if by deleting a 
clause from Ft, T is not induced by the new formula anymore. By the definition of minimality, 
\Ft\ < |-E'(r)| = |T^(r)| — 1 (as r is a tree). In our argument we shall be interested only in {T,Ft) 
s.t. Ft is a minimal set of clauses that induces T. 

Given a tree T of size r, we estimate the number of ways to extend T to a minimal inducing set 
Ft- Every clause in Ft can cover either one or two edges of T (it cannot cover three edges or we 
have a cycle in T). Following the argument in [H], let Nt^s be the number of ways to pair 2s edges 
of T to form s clauses in Ft that cover two edges. There are 8 ways to set the polarity of variables 
in every clause of Ft (and there are r — 1 — s such clauses), and at most n'^~^~'^^ ways to choose 
the third variable in the r — 1 — 2s clauses that cover exactly one edge. Using this terminology, 
the expected number of r-trees induced by a random formula F, generated according to Vn,p with 
n'^p = d, is at most: 

E i:^T,8^-'~-'n'-'~''(^V'^< E (Y.^T,]{8drn^-r (2) 

r— trees s=0 r— trees \s=0 j 

Our next task is to obtain useful upper bounds on the sum X]!,=o-^T",s- To this end let us fix a 
degree sequence (d\,...,dr) for T, and consider the following procedure for properly pairing edges. 
By proper we mean that every pair of edges can be covered by a 3CNF clause; for example, we 
cannot pair the edges (xi,X2) and (x3,X4) as they result in a 4CNF clause. For each vertex, we 
specify a permutation of the edges incident to that vertex. Then we iterate through the vertices, 
and for each vertex, we iterate through the edges and pair up each unpaired edge with the edge 
given by the permutation associated with the current vertex (and leave the edge unpaired if the 
permutation sends the edge to itself). Any pairing of edges which can be covered by clauses can be 
generated this way by choosing the permutations to transpose each pair of edges to be covered by 
a single clause and to leave fixed all the other edges. Since there are dj! diff'erent permutations for 
vertex i, we have 

r/2 r 

Y^NT,s<^d,\. 
s=0 i=\ 

A classical result by Priifer is that the number of r-trees with degree sequence {di, ...,dr) equals 
id -i~ d -i) (^^^' ^°'- example, [20], Section 4.1, p. 33). There are (") ways to choose the r vertices 
of the tree. So (EI) is at most 

J\dA {MYn^-^ < E \i\A (^dYn. 

i=l / di+...+dr=2{r-l) \i=l / 
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By convexity, for (di, ..., dr) with di + ... + dr = 2(r — 1), the product ni"=i di is maximized when 
di = . . . = dr, and so ni=i di < 2'^. The number of ways to choose positive integers [di, ...,dr) so 
that di + ... + dr = 2(r — 1) is ( ^TTi ) which is less than 2^^". Hence, the expected number of r-trees 
induced by a random formula F is at most n ■ 2^^ • 2*" • (8dY < n{100dY . I 

Proof of Lemma |4l Fix a tree T in F' = FiU J on r vertices (recall that J is the set of clauses 
that contain at most one variable from 7i), and consider the set of clauses that have at least two 
variables in 7i, which we now toss their coins (we use K to denote the set of clauses that were 
chosen among the latter). 

Assume w.l.o.g. that the assignment ip, w.r.t. which TC is defined, is the all-TRUE assignment. 
Look at a variable x G V{T). We call a clause (x V zi V Z2), (x V Z3 V Z4), where the Zj's are some 
variables in 7i, a type 1, respectively type 2, clause. If clauses of both types appear in K then x 
surely belongs to S (and therefore V{T) n 5 7^ 0). We call x ^ 7i elusive if at least one of the two 
types of clauses didn't appear in K. 

Set p = 1 — e~^\" P\ since l^l > pn there are at least C^) > {pnY/3 clauses of type 1. We 
assume that Fi is bounded and hence every variable appears in at most n clauses, therefore at most 
n clauses of type 1 have been included in Fi. An identical argument applies to clauses of type 2. 
Note also that the clauses of J cannot belong to any of the types. Therefore the probability that 
no clause of type 1 belongs to K is at most (1 —p2r"'' '^~" < e~'^''^ (here we use: d = ri^p is large, 
P2 = (p — Pi)/(1 — Pi), Pi = p/2). The same is true by symmetry for clauses of type 2. Let Ex be 
the event that x is elusive, and let Pi be the event that no clause of type i for x appeared, i = 1, 2 
(namely, ^^ = Pi V Pa)- 

Pr[Ex] = Pr[Pi V P2] < Pr[Pi] + Pr[P2] < 26"'^/^ < e''^^^. 

Further observe that for x ^ y the events E^ and Ey are independent as they involve disjoint sets 
of clauses (each variable supports its own set of clauses). Recall the events A, B which were defined 
above. In this terminology we just upper bounded the probability of B given A, and therefore the 
following is true: 

Pr[B\A] < (p-'^l^y = p-'^'^/s 



Since our upper bound on Pr[i3|^] only depends on the fact that |T^(r)| = r, then also 

q = max Pr[PUl < e"'^'''^. 

T:\V{T)\=r 



6 fc-Colorability 

In this section we will discuss, in a high level fashion, how one can obtain similar results to the ones 
we have for /c-SAT for the random graph process (of /c-colorability) . Before we start our discussion 
let us recall the algorithm due to Alon and Kahale for coloring A;-colorable graphs [3] . The first step 
of the algorithm is a spectral step; specifically a /c-coloring of the graph (not necessarily proper) 
is obtained by looking at some eigenvectors of the graph (that hopefully reflect in some sense a 
proper A;-coloring) . Then, this initial fc-coloring is refined using a series of combinatorial steps (very 
similar to our Steps 2-4 in Algorithm SAT), until possibly a proper A;-coloring is reached (or the 
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algorithm fails). The algorithm was analyzed on graphs drawn from the planted distribution first 
defined at [19] (the distribution is defined by the following procedure: partition the vertex set into 
k color classes of size n/k each: Vi, V2, • • • ,Vk', next, include every Vi — Vj edge with probability 
p) . The algorithm was shown to find whp a proper A;-coloring of the graph when np > ck'^ , c some 
sufficiently large constant. 

It is possible to prove that the algorithm works also for graphs drawn from our distribution for 
the same edge density (maybe the constant c is different). The main challenge is to reprove the 
spectral properties of the graph. The basic idea is to notice that whp every fc-coloring has all of its 
color classes of linear size, and also to prove discrepancy properties (similar, yet more elaborate, 
to Proposition [1]) . Another crucial ingredient in the proof is establishing a similar notion of a core 
(Definition ED. 

Unfortunately, at this point we are still unable to answer a seemingly much simpler question: 
how many edges will such a graph typically contain by the end of the process? We expect the 
answer to be about (2) (f ) ^ which would corresponds to the case where a unique final /c-coloring 
is nearly balanced. 

7 Discussion 

As we already mentioned, only a vanishing proportion of A;-CNFs with m clauses over n variables 
are satisfiable when m/n is above the threshold. In recent years, several papers studied different 
distributions over satisfiable 3CNF formulas in the above threshold regime, more precisely some 
sufficiently large constant factor above the threshold. In particular, [T¥] considered the planted 
3SAT distribution, and [9j addressed the planted and uniform distributions, both papers developing 
new analytical and algorithmic techniques. Our work joins this line of research by studying a new 
distribution over satisfiable 3CNF formulas, and once again introducing new analytical ideas to face 
the intricacies ofV^^. Furthermore, one interesting conclusion emerges from combining [Il],[9] and 
our result. In all three distributions the instances show basically the same uni-cluster structure of 
the solution space, and the same algorithm solves them all. This gives rise to the following question: 
does forcing (in some "natural" way) the unlikely event of being satisfiable in the above threshold 
regime generally result in the structure suggested by Theorem [1] (for clause- variable ratio greater 
than some sufficiently large constant)? This question has been answered positively for the planted 
and uniform distributions, and in this paper for the random satisfiable 3CNF process. 
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